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Abstract 

Consider the construction of an object composed of m parts by distributing n 

units to those parts. For example, say we are assigning n baUs to m boxes. Each 

C/3 , assignment resuhs in a certain count vector, specifying the number of balls allocated 

to each box. If only assignments satisfying a set of constraints that are linear in these 

counts are allowable, and m is fixed while n increases, most assignments that satisfy the 

Cn ' constraints result in frequency vectors (normalized counts) whose entropy approaches 

that of the maximum entropy vector satisfying the constraints. This phenomenon of 

"entropy concentration" is known in various forms, and is one of the justifications of the 

^^ , maximum entropy method, one of the most powerful tools for solving problems with 

\^ ' incomplete information. The appeal of entropy concentration comes from the simplicity 

[^ , of the argument: it is based purely on counting and does not need probabilities. 

f^ ' Existing proofs of the concentration phenomenon are based on limits or asymptotics. 

Here we present non-asymptotic, explicit lower bounds on n for a number of variants 

of the concentration result to hold to any prescribed accuracies, taking into account 

the fact that allocations of discrete units can satisfy constraints only approximately. 

The results are illustrated with examples on die tossing, vehicle or network traffic, and 

the probability distribution of the length of a G/G/1 queue. 
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1 Introduction 

Consider a process which is repeated n times and each repetition has m possible outcomes. 
For concreteness we may think of assigning n balls to m labelled boxes, where each box can 
hold any number of balls. The first ball can go into any box, the second ball can go into 
any box, ..., and the nth ball can go into any box. Each assignment or allocation is thus a 
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sequence of n box labels and results in some number i^i of balls in box 1, z/2 in box 2, etc., 
where the fj are ^ and sum to n. There are TTi" possible assignments in all, and many 
of them can lead to the same count vector i/ = (z^i, . . . , I'm)- We refer to these assignments 
as the realizations of the count vector. 

The arrangement of n balls into m boxes can represent the construction of any object 
consisting of m distinguishable parts out of n identical units. So if the balls represent 
pixels of an image, the attributes of color and (suitably discretized) intensity are ascribed 
to the boxes to which the pixels are assigned. Then the count vector is thought of as a 2- 
dimensional matrix with rows labelled by intensity and columns by color. Other examples 
are people categorized by age, height, and weight, vehicles classified by weight, size, and fuel 
economy, packets in a communications network with attributes of origin, destination, size, 
and timestamp, etc. The object can even be a (discrete) probability distribution. When the 
process simply represents the classification of n units by m discrete or discretized attibutes, 
it is known as a (multi-dimensional) contingency table. 

Now consider imposing constraints C on the allowable assignments, expressed as a set of 
Hnear relations on the elements of the frequency vector / = {I'l/n, . . . , I'm/n) corresponding 
to the counts i^. E.g. 5/i — 17.4/2 ^ 0.131, /12 ^ /15, etc. As n grows, the frequency 
vectors of more and more of the assignments that satisfy the constrtaints will have entropy 
closer and closer to that of a particular ?7i- vector ip* , the vector of maximum entropy H* 
subject to the constraints C. (We denote this vector by ip*, as opposed to /*, to emphasize 
that its entries are, in general, not rational.) This result is known, more or less, in many 
forms: the original is E.T. Jaynes's "entropy concentration theorem" |Jay82| , |Jay83| , 
in the information theory literature it is the "conditional limit theorem" |CT91j . and in 
computer science there is "strong entropy concentration" |Grlj . |Gr8)H All these results 



involve limits or asymptotics in one way or another, i.e. in the statement 

given an e > and an r] > 0, there is a N{e,r]) such that for all n ^ 
N{e,r]), the fraction of assignments that satisfy C and have a frequency 
vector with entropy within r] of H* is at least 1 — e. 



one or more of the quantities e, i], or N is not given explicitly. For example, it is well- 
known that the fraction of assignments that don't satisfy C is 0(e~^" ). Note that 
EC is simply a problem of counting; there is no uncertainty, no randomness, and there 
are no probabilities anywhere. (However, the results can be applied to the derivation of 
probability distributions.) Our purpose in this paper is to derive explicit expressions for 
N{£, 77) assuming that the maximum entropy vector f* E M™ and its entropy H* are 
knowiu. Given a concrete problem with incomplete information, these expressions allow 



^When we say "more or less" and "in many forms" we mean that similar statements are made about 
similar or the same things, but it is not our purpose here to enter into a detailed comparison. 

^"Explicit" means that there is not a single O, not even a &, and much less an Q to be found in the 
whole paper. 



us to assess the "reliability" of the MaxEnt solution to it as we illustrate in 21 We also 
establish a number of new results, as detailed at the end of this section. 

Before proceeding, we give a very simple illustration. Consider assigning 5 balls to 3 
boxes labelled A,B,C without any constraints at all (other than the fact that all balls 
must be assigned, i.e. the frequencies must add up to 1). There are 3^ = 243 possible 
assignments, e.g. A,A,B,A,C, meaning that the first two balls go into box A, the third 
into box B, the fourth into A again, and the fifth into box C. Table [TT] lists the possible 
box occupancies or count vectors, and the number of realizations of each count vector, 
denoted by ^, i.e. the number of assignments that result in this vector. These numbers 
are given by multinomial coefficients, e.g. #(3,0,2) = (302) ~ ^^- Finally the table gives 
the entropy H{f) = — ^- filnfi of the frequency vector / = u/5 corresponding to each 
count vector v. Both Table [TTT] and its graphical counterpart, Fig. II. 1^ show the beginnings 
of the concentration phenomenon even in this very small case. 



count vector v 
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H{f) 
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1,0,4 
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count vector v 


#^ 


H{f) 


3,1,1 


20 


0.950 


2,1,2 


30 


1.055 


1,1,3 


20 


0.950 


0,1,4 
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3,2,0 


10 


0.673 


2,2,1 


30 


1.055 


1,2,2 


30 


1.055 



count vector ly 


#1^ 


H{f) 


0,2,3 


10 


0.673 


2,3,0 


10 


0.673 


1,3,1 


20 


0.950 


0,3,2 


10 


0.673 


1,4,0 


5 


0.500 


0,4,1 


5 


0.500 


0,5,0 


1 






Table 1.1: m = 3,n = 5. The 3^ = 243 realizations/assigments exhibit rudimentary 
entropy concentration: 150 of them have frequency vectors with entropy within 23% of 
H* = ln3 = 1.099. 



The main point of this small example is to re-emphasize that statement EC has to do 
simply with counting, and nothing to do with any probability considerations. Nevertheless, 
the reader might still think that we are simply avoiding the introduction of a uniform p.d. 
on the set of all m^ possible assignments, and that without the assumption that all these 
assignments or outcomes are equally likely, the concentration statement EC really has little 
"practical" significancqij. In fact, quite the opposite is true: the phenomenon of entropy 
concentration justifies the assumption of uniformity in the absence of any other knowledge, 
i.e. Laplace's famous "principle of indifference" or "principle of insufficient reason" ! Indeed, 
in the absence of any constraints other than that the frequencies must sum to 1, entropy 
concentration shows that the uniform frequeny distribution is simply the one that can be 
realized in the greatest number of ways, or most likely to be realized ( |Jay82| , §IV), and 
is therefore to be preferred; we see indications of this even in our small example. We have 
more to say on this matter in ^3.41 

In the following development we will take the dimension m of the problem to be given 



Even the author has — occasionaUy — fallen prey to this ingrained viewpoint. 




243 assignments 



21 frequency vectors 



5 entropy values 



Figure 1.1: Graphical representation of the data of Table 11.11 assignments — )• 
count/frequency vectors — )• entropies. The 6 frequency vectors with the two highest en- 
tropies have more than half of all the realizations, and the most likely vectors to be realized 
are the ones closest to the uniform (1/3, 1/3, 1/3). 



and fixed, and concern ourselves solely with n. For a given n, we denote the set of all 
n-frequency vectors by Fn, i.e. F„ = {(/i, . . . , fm)\fi = Vi/n, vi + ■ ■ ■ + Vm = n]. We 



represent the constraints C on the frequency vector / or count vector v by 



Af ^b <^ Av ^ bn, 



feFn,uen, 



(1.1) 



where the ?7i-column matrix A and vector b are real constants, independent of n. (At 
this level of generality we think of equalities as represented by pairs of inequalities; more 
detail is introduced in ^) Such inequality constraints are quite expressive: as just one 
example we mention |Tho79] . where inequalities are used to represent the limited informa- 
tion/uncertainty concerning oil spill scenarios. 

There is a long-standing issue with the MaxEnt method having to do with "expecta- 
tions vs. measured values" in constraints. Jaynes has offered a resolution, in |Jay03| §11.8 
among other places, but our purely discrete formulation, entirely devoid of probabilities, 
and with the constraints expressed by (jl.ip . eliminates this issue in its entirety. 

Now treating the frequencies as reals instead of rationals, we assume that the constraints 
(jl.ip are satisfiable. Then they define a (non-empty) polytope in M™, and maximizing the 
entropy 

H{x) = — y^ Xilnxi subject to A'x^b', (1-2) 



where A', b' are A, b augmented with ^^ Xj = 1 and Xj ^ 0, is a strictly concave maximiza- 
tion problem (see e.g. |ADSZ88] ) which has a unique solution 99* E M"* with maximum 
entropy H*. The elements of (p* are non-negative reals, with values independent of n. 

In the discrete problem however, some care is required in connection with equalities in 
the constraints (jl.ip . whether they are explicit or implied by inequalities. For example, 
suppose one constraint is /i + /2 = 1/139. This is not satisfiable unless n is a multiple of 
139, rendering the statement ii'C above impossible. When n is large however, in many cases 
it is perfectly acceptable if the equalities are simply satisfied to a good approximation; in 
fact, the same goes for the inequalities. For this reason we will assume that the constraints 
(II. ip need to be satisfied only approximately, to within a tolerance 6 > 0. Besides being 
necessary for a rigorous development, this tolerance may also be regarded as reflecting 
some uncertainty in the exact values of A and b. 

In addition to introducing a tolerance in the constraints, we will also develop a more 
intuitive variant of the concentration result EC, around the MaxEnt vector ip* instead of 
the MaxEnt value H* . So we will be establishing a modified and more general version of 
statement EC: 



EC: 



given positive tolerances 6, e, and rj or f?, as described in Table 11.21 there is 
a N{5, e, rj) or N{S, e, ??) such that for all n ^ N, the fraction of assignments 
that satisfy C to accuracy 6 and have a frequency vector with entropy within 
rj of H* or no farther than f? in norm from 99*, is at least 1 — e. 

A simpler way of saying this is that as the size n of the problem increases, there is a 
count vector u* 

1. whose corresponding frequency vector /* is arbitrarily close to ip* , and satisfies the 
constraints to any prescribed accuracy, and 

2. out of all the assignments that satisfy the constraints to this accuracy, the fraction that 
realize a vector as close as desired to /*, either in entropy or in norm, is arbitrarily near 
1. 

This way of expressing the concentration result smacks of asymptotics, but we keep the 
more precise statement EC in mind. 



relative tolerance in satisfying the constraints 
concentration tolerance, on number of realizations 
relative tolerance in deviation from the MaxEnt value H* 
absolute tolerance in deviation from the MaxEnt vector ip* 



Table 1.2: Tolerances for the entropy concentration results. 

In addition to EC, we will prove a more forceful variant which refers solely to the 
realizations of the vector /* itself, as opposed to those of a whole set of vectors close to it: 



^^ given positive tolerances 6,e, and r] or •&, as described in Table [L2l there is a 
N{d, e, rf) or N{5, e, -d) such that for all n ^ N, the vector /* € F„ has more 
than 1/e times the realizations of the whole set of vectors that satisfy C to 
accuracy S but have entropy not within rj of H* or are farther than i9 in norm 
from ip*. 

Summary In ^ we go into more detail on the various tolerances, in particular 6, which 
relates the exact solutiorO of the continuous maximum entropy problem to the approximate 
solution of the discrete counting problem. The main idea is that for a given n we obtain an 
optimal count vector u* by rounding and adjusting the vector nip*. We then show for each 
of the desired properties how large n must be for v* and the frequency vector /* = v* /n 
to have this property. These results are put together in ^ where we establish statement 
EC'. In ^3.1l we prove EC with a tolerance 77 on the deviation from the maximum entropy 
value, and in §3.21 we discuss how our bound compares with the well-known asymptotic 
result of E. T. Jaynes. We derive the result EC" on the vector /* itself in §3.2.21 In §3.31 
we establish EC and EC" with the more intuitive tolerance 'd on the norm of deviation 
from the maximum entropy vector. We use this norm version in §3.41 to point out that our 
concentration results also apply to the derivation of discrete probability distributions by the 
method of maximum entropy. Thus the exact, non- asymptotic concentration phenomenon 
is a very powerful justification for the most common use of MaxEnt. The other major 
justification is various axiomatic formulations, but the simple statements of the concentra- 
tion results and the purely combinatorial character of their derivation have a force of their 
owrO. In §3.41 we also give a further elaboration that provides a quantitative justification 
of the principle of indifference or insufficient reason. 

The expressions for N(5, e, rf) or N{5, e, ■!?) use the solution ip* to the maximum entropy 
problem, the value H* of the maximum entropy, and norms of the matrix A and vector 
h defining the constraints. A main point of the paper is the computations made possible 
by the bounds. Therefore in ^we give three examples with detailed numerical results on 
the lower bound N , one using the classic die-tossing experiment, one involving a vehicle 
or network traffic problem, and one having to do with a simple queue. We also discuss 
the relationship to Sanov's bound (from information theory), and the exact computation 
of the number of count vectors satisfying the constraints. 

The proofs of all of our results are given in the Appendix, so as not to disrupt the flow 
of the exposition. 



^Sometimes this solution may be analytical, and if it is numerical we assume it is sufficiently accurate 
to be called "exact" . 

^The maximum entropy method itself is not the subject of this paper. There is an extensive body of 
work on it: for example the works |Ros83) . [Jay03| of E. T. Jaynes, the books [Tri69) . [KK92| . the series of 
MaxEnt conference proceedings [ME98J and |ME99| . etc. 



2 Basic results: tolerances 

We define the rounding of a positive real number x to an integer [x] in the usual way, so 
that it satisfies jx — [x]! ^ 1/2. Given an n G N, from the MaxEnt vector ip* we derive a 
count vector u* and a frequency vector /* by a process of rounding and adjusting: 

Definition 2.1 Given ip* and n ^ m, let u = [nip*] and set d = ^^ Ui — n. If d = 0, let 
I'* = V. Otherwise, if d < 0, add 1 to \d\ elements of v that were rounded down, and if 
d > 0, subtract 1 from \d\ elements that were rounded up. Let the resulting vector he u* , 
and define f* = v* /n, f* e Fn- 

Unlike </?*, both of the vectors u* and /* depend on n, but we will not indicate this 
explicitly to avoid burdensome notation. The adjustment of u in Definition 12.11 ensures 
that the result u* indeed sums to n, so /* is a proper frequency vector. (This adjustment 
is always possible because if d 7^ 0, there must be at least |d| elements of nip* that were 
rounded to their floors if d < 0, or to their ceilings if d > 0. And \d\ ^ L"^/2j by the 
definition of rounding.) 

The fundamental observation is that when n is large enough, /* is arbitrarily close to 
ip : 

Proposition 2.1 Given any 7 > 0, the frequency vector f* is s.t. 

where fi* is the number of non-zero elements of f* (and ip*). 

Recall that the ii norm is the sum of the absolute values, whereas the ioo norm is the 
maximum of the absolute values ( |HJ90j . §5.5). 

The MaxEnt vector ip* satisfies the constraints (jl.ip exactly. Now we show how /* 
satisfies them approximately, and how i'* satisfies the scaled constraints approximately. 

2.1 Constraints on frequency vectors 

All constraints on the frequency vectors can be expressed by inequalities as in (jl.ip . An 
equality, e.g. 5/i + 3/2 — fs = 0.34, can be formulated as two inequalities, 5/i + 3/2 — 
fs ^ 0.34 and — (5/i + 3/2 — /s) ^ —0.34. In practice however, we may, for example, 
consider equalities to be more important than inequalities, and may want to assign different 
tolerances to them. Further, if we want to use tolerances that are relative to the magnitudes 
of the elements of b, the presence of zeros requires special treatment. For these reasons 
we will separate the constraints (II. ID into four categories: equalities with non-zeros on the 
r.h.s., inequalities with non-zeros on the r.h.s., equalities with zeros, and inequalities with 
zeros. 

7 



We represent the first category using a matrix A^ and vector b^ as A^x = b^, where 
all elements of b^ are non-zero. We want /* to satisfy the equality constraints with a 
maximum error which is no more than a constant 5^ > times the smallest element of b^ 
in absolute value. So we require 

A=f* = b=+f3 with ||/3||oo ^ <5=|?>=|min. (2.1) 

Similarly, formulating the inequalities with non-zeros as A'^x ^ 6^, we require 

A^/*^fe^+;3 with ||/3||oo ^ <5^|&^|min. (2.2) 

(The /3s in ()2.ip and (|2.2p are different, but we don't want to complicate the notation.) 
Coming to the constraints with Os on the r.h.s., e.g. /2 = /s, fi ^ fi+ /s, etc., we require 

A=T = C with llClloo ^ S=^ (2.3) 

and 

A^y* < C with llClloo ^ <5^° (2.4) 

for some positive 6^^ and 6^^. 

(p* satisfies all of the constraints exactly. So any f £ Fn close enough to ip* should 
satisfy (12. ip to (j2.4p for any positive tolerances. Indeed, using the abbreviation 5 = 

Proposition 2.2 Given any 6 > 0, set 



-i^oo = niin 



I 4=111 ' III 4^111 ' III 4=0|ll ' III A^0\\\ 

1^ llloo 111^ llloo 111^ llloo 111^ |||c 



or oo if there are no constraints. Then any f £ Fn such that \\f — 'p*\\oo ^ "^^oo satisfies 
[21]) . / fO|) . (EM>, and M. 



Recall that the infinity norm ||| • |||oo of a matrix is the maximum of the £i norms of the 
rows. By Proposition 12.11 /* satisfies Proposition 12.21 if n ^ l/^?oo- 

2.2 Entropy 

Turning now to entropy, we point out that if a frequency vector / is close enough to (/?*, 
its entropy can be as close to H* as desired: 

Proposition 2.3 For any^ > 0, if f is s.t. \\f — p*\\^ ^ 7, then H* — H (/) ^ m7lnl/7. 

It follows that 



Proposition 2.4 Given an entropy tolerance rj > and rj ^ m/ {21H*), if f is s.t. 

||f_ *|| < 2 VH* 

"^ ^ "°" ^ 31n(m/(77if*))' 

then {l-i])H* ^ i7(/) ^ H* . 

(The condition ry ^ m/ {21H*) is not much of a restriction, and is explained in the 
proof.) In view of Proposition 12.11 /* will satisfy Proposition 12.41 when n is large enough. 
Proposition 12.41 is used to establish Lemma 13.21 in §3.11 

3 Concentration 

We establish the concentration result EC' stated in the Introduction: in ^3.11 we prove 
the first version, expressed in terms of deviation from the maximum entropy value H* , 
and in ^3.31 we prove the second version, phrased in terms of deviation from the MaxEnt 
vector if* . We also establish the statement EC" in its two versions. To avoid cumbersome 
notation in what follows, we denote the tolerances on the constraints collectively by 5 = 

3.1 Maximum entropy value 

Let C{6) be the set of m- vectors that satisfy the constaints (j2.ip to (j2.4p to accuracy 5: 

C{5) = {xeW^\x satisfies (EI]) to ^^ with 5 = ((^=, <5^, <5=°,5^°)}. (3.1) 

Now given an i] > 0, consider the following two sets of frequency vectorqj. An{5,ri) is the 
set of vectors in F^ that lie in C{5) and have entropy at least (1 — rj)H*: 

M^, r?) = {/ G F„ n C{6),H{f) ^ (1 - 7^)H*}. (3.2) 

Bn{6,r]) is the complementary set of frequency vectors, i.e. those in C{6) but with entropy 
less than (1 - ri)H*: 

Bn{6,v) = {/ G FnnC{5),H{f) < (1 - 7?)F*}. (3.3) 

Clearly, F„ = An{S, rf) U Bn[5, rj) irrespective of the values of 6 and ry. 

The number of realizations #/ of a frequency vector / is related to its entropy H{f). 
A simple result is Lemmas II. 1 and II. 2 of |Csi99j 

V/GF„, y^+^e"^(^) ^ #/ ^ e'^^(^), (3.4) 

V m-l ) 

but a much more precise result is 



®Our notation is similar to that of |CT91| . §12.6. The set A„ is not to be confused with the matrix A 

of dni. 



Proposition 3.1 Given f £ F„, let fi, ■ ■ ■ , f^, n ^ 1, be its non-zero elements (^f does 
not change when f is permuted). Define 



Sif,n) 



1 



{2Tm)^2 \//i • • • //i 



Then #/ is hounded as follows: 



S{f,n) 



J- nH(f) 



(The bounds hold even when f^ = 1 and 7^/ = 1.) 

Using the bounds of Proposition 13. 11 we will now show that given any e > 0, there is a 
number N = N{e) s.t. ii n ^ N, then all but a fraction e of the realizations/assignments 
that satisfy the constraints have frequencies in the set ^„((^, ry): 



#Ar,i6,rj) 






^ 1 



(3.5) 



The proof consists of deriving a lower bound on #A„ and an upper bound on i^Bn, taking 
their ratio, and deriving a lower bound on n so as to ensure that the ratio is at least 1 + 1/e. 
It is similar in spirit to the proof of the "conditional limit theorem", Theorem 12.6.2 of 
[CT9T] . 

First, the upper bound on 4fBn- Recall from the beginning of ^that we are using the 
abbreviated notation 5 = {6^, 6^, 6^^, 5^^). 

Lemma 3.1 Given any 6,r] > 0, 

#5„((5, 7?) < 4.004 V2^ 0.6™ n"^ e^^^"'?)^* , 

where the numerical constants assume that n ^ 100. 

This bound is independent of 5. The use of Proposition 13. II over the simpler (13. 4p resulted 
in ?i('"~i)/2 instead of n™. 

For our lower bound on i^An we need an auxiliary lower bound, on the number of 
frequency vectors that lie in an m-dimensional cube centered at ip* and of side 2??: 

Proposition 3.2 Let fi* ^ 1 be the number of non-zero elements of (p* , (/^^ax ^^ ^^-^ largest 
element, and V'min ^^-^ smallest non-zero element. Let -d be a positive number s.t. •& ^ ^ma.x 
and '& ^ (/i* — 1) v^J^in. Then the set {/ E F„ | ||/ — (p*\\^ ^ "&} contains at least 



n-d 



1 



m 



1 



+ 



1 



/^^ 



1 



M*-i 



wd 



m 



1 



»71 — /i 



A(n,^,/i*) 



elements. If ^* = 1, the first factor in this expression and the second condition on {} are 
absent. 
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The two extreme cases, when ah the elements of ip* are non-zero, and when only one is 
non-zero, yield, respectively, Yln'd/{m — 1)]"*"^ and yn'd/{m— 1)J™~^. Fig. 13. II illustrates 
the difference in the case m = 2. 



h t 



h t 



i9 



i) 



/i' 



{} 



Figure 3.1: Illustration of Proposition [312] in two dimensions when /i* = 2 and when /_i* = 1. 

Now let a G (0, 1) be a parameter, and define the number 

> a-qH* 



^n 



min I § 



3 In [m/[ar}H*)) 



>¥'i, 



(3.6) 



where ??oo has been specified in Proposition 12.21 and v^j^in in Proposition 13.21 -do depends on 
5, r] and a, which is specified in Theorem 13. II below. Our lower bound on i^An, the number 
of realizations of An, is based on a lower bound on the size of An, obtained by using the 
tolerance "i^o in Proposition! 



Lemma 3.2 Given any 6,rj > 0, and some a £ (0, 1), we have 

\An{6,r])\ ^ A{n,^o,^^*) 
and 



^ 



#An{5,v) ^A(n,T?o,/i*)V27r(|- 



A**/2 



e 12 n 2 e 



n{l—ari)H* 



Typically fi* = m, and then n ^ (m — 1) / (2t?o) is necessary for An {^,7]) to not be 
empty and for i^An {6, rj) to be at least 1. Otherwise we need n ^ {m — 1) /"do. There is 
a savings similar to that in Lemma 13.11 due to the use of Proposition 13.11 over the simpler 

The main result following from Lemmas 13.11 and 13.21 depends on many parameters and 
we have taken some care to reduce the slack in the bounds, so it reads more like the 
specification of an algorithm rather than a theorem: 

Theorem 3.1 Given any S,e,r] > 0, let a £ (0, 1) be a parameter whose value is specified 
below. With n* the number of non-zero elements of ip* , define the constants 



Ci 



0.5(m + //*)-! 
(1 - a)r]H* ■ 



Co 



mlnO.6 + (0.51n27r + 1/12 - 0.51n/i*)/i* + \^(^^^£ 
(1 - a)rjH* 
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and set 

N( \ = I ^-^ ^1 ^"^^^^ + '^2) + C2, ?/ C2 > and Ci+C2> 21, 
^"^ \ 1.5 Ci In Ci + C2, i/ C2 ^ 0. 

Let a G (0, 1) he the solution of the equation 

771—1 



N{a) 



[2,^* = mli?o(«)' 



where the notation fx, i^J yields x if boolean condition B holds and 1 otherwise, and t?o ^s 
given by iH.6]) . Finally set 

^^,„, m-1 fSln (m/(ariH*)) 1 



where "i^oo is specified in Proposition \2.2\ and f'^i^ in Proposition \3.2[ Then for all n ^ N 

we have 

#{fGFnnC{6),H{f)^il-7^)H*} 



#{/GF„nC(5)} 



^ l-£. 



Given any tolerances 6, e, r/, the theorem shows how to calculate a number N{5, e, rf) 
s.t. if n ^ N{6,e,r]), then all but the fraction e of the assignments of the n objects to 
the m boxes that satisfy the constraints to accuracy 5 have entropy within 1 — r/ of the 
maximum. An analogue of this result, but phrased in terms of a deviation 1? from the 
maximum entropy vector tp* , is given in §3.31 



3.2 Discussion 

There are a few things to note in connection with Theorem l3.lt 

1. The first term in the expression for A^ depends on the adjusted deviation AH = arjH* 
from the value of the maximum entropy, the second depends on the tolerances 5 for 
satisfying the constraints ( §2. ID . and the third depends on the smallest non-zero 
element of the solution if* to the maximum entropy problem, e is hiding in the value 
of a, see 5 below. 

2. If the constraints (jl.ip do not force any element of if* to be 0, we will simply have 
fi* = m. 

3. Roughly speaking, A^ is at least m/i^oo) at least rra/c^^jj^, and at least (m/AH) lii{m/AH) 
as well. 

4. By examining the expressions for Ci and C2 it is clear that the value of N(a) is 
sensitive to rj, but not very sensitive to e. This carries over to A^, and is illustrated 
numerically in §4l 
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5. The l.h.s. of the equation determining the parameter a depends on e and 77, whereas 
the r.h.s. depends only on b. The optimal value a depends weakly on e, and is 
essentially a function of r]. This is illustrated in ^4.2i 



6. Finally, the assumptions n ^ 100 in Lemma |3. II and Ci + C2 ^ 21 in Theorem 13.11 
are very easily satisfied in applications. 

3.2.1 Comparison with the results of Jaynes 

We compare Theorem 13.11 with the original concentration theorem of E.T. Jaynes: 

Theorem 3.2 ( jJayS^ , or \Jay83^ , Ch. 11) Suppose the constraints consist of i linearly- 
independent equalities, and set AH = H* — H. Then, as n —)■ 00, 2NAH = Xm-e-i(^) ' 
where xl is the chi-squared distribution with k degrees of freedom. 

This says that 

r(s + 1) j„,^H 

where s = {m — t — l)/2 — 1 and the r.h.s. represents the tail of the chi-squared density. 
This tail is the normalized incomplete gamma function, with the asymptotic expansion 
_i_g-nAH(^^j^)s (1 + ^ + . . . ) (see, e.g. [AS72] . eq. 6.5.32) . Thus ignoring i (but 

see ^3.2.3p , and retaining only the first term of the above series, it can be seen that (j3.7|) 
requires nAH ^ s \n(nAH) — ln(er(s + 1)). This translates to n ^ Ci In n + C2, where the 
constants are 

Ci = -^, C2 = —-{slnAH-HeT{s + m, 



AH' AH' V V ///. 2 

Comparing this with the A'^i of Theorem 13.11 we see that there is qualitative agreement 
between our exact bound and Jaynes's asymptotic result, and the Ci's are similar. (In 
fact, it follows from Lemma [3. 21 and Proposition 13.21 that asymptotically our Ci is better.) 

3.2.2 The MaxEnt vector itself 

It may seem in some sense unsatisfactory that Theorem 13.11 says that an entire set of 
vectors around the MaxEnt vector cp* is dominant. Indeed, using elements in the proof of 
Theorem 13.11 it is possible to say something about just /* itself. The result can be stated 
more simply than Theorem 13.11 holds for smaller n, and even shows that /* is closer than 
7? to (f* in entropy: 

Lemma 3.3 Given any 5,e,ri > 0, let a € (0, 1) be the solution of N{a) = l/'dQ{a), with 
N{a),'dQ{a) as in Theorem \S.1[ Then if 

n>max''^l^^("^/("^^*)) ' ' 



2ar]H* t^oo >* 
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the frequency vector f* is such that 

/•e^„(W,) and ^)^,f„ncw*ii/)<(i-,m >? 

This is the statement EC" in the Introduction. In simple terms, it says that for n 
about m times smaller than what Theorem 13.11 requires, the MaxEnt frequency vector 
/*, whose entropy differs from H* by less than arj, has all by itself 1/e times as many 
realizations as the entire set of vectors that satisfy the constraints but have entropies not 
within -qoiH*. See Fig. E21 

F„ n C{5) 
H(f) > (1 - ari)H- 




Figure 3.2: The sets An{S,rj), Bn{6,r]), and An{6,ari) for Lemma [3^ 

The closeness of the excluded vectors to /* is controlled by r] and can be made as tight 
as desired. Nevertheless, we cannot exclude everything around /*: the simplest counter- 
example is n even, m = 2, and no constraints; then („'/2)/(n/2±i) — ^ 1 as n increases. 

3.2.3 Improvements 

In perusing the various results and their proofs with an eye toward improvements, we notice 
that if the constraints (jl.ip include (linearly-independent) equalities, say i of them, this 
would reduce the dimension m of of Fn by i. Our results could then be re-worked using 
a notation such as Fn^m-t-, which makes the dimension explicit. We will not pursue this 
improvement further here. Another possibility is to improve the bound of Proposition 12.41 
as noted in its proof. A shortcoming in the results on which Theorem 13.11 is based is that 
the bound on #j4„ is sensitive to 5 but the bound on jj^Bn is not. 

3.3 Maximum entropy vector 

The development in ^3.11 used a tolerance i] in deviation from the maximum entropy value 
H* . Here we re-cast this development in terms of a more intuitive tolerance t? in deviation 
from the maximum entropy vector tp* . This formulation will be very useful when we deal 
with probability distributions in §3.41 So, given a t? > 0, we re-define the sets An and Bn 
of (1321) and (isi as 



<(<5, ^) = {/ G F„nC(5), ||/-v^*||i ^ ^}, B'n{6,i^) = {/ G F„nC(<5), ||/-<^*||i > ^}. (3.8) 
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These sets form a partition of F^ nC(5) for any 6 and any suitably small ??, as was the case 
in gMl 

To count the realizations of these sets we need a connection between differences in norm 
and differences in entropy. If / is close to (/?* in norm, its entropy is close to H* , and if it 
is far from ip*, its entropy cannot be too close to H*: 

Proposition 3.3 Given < -d ^ 1/2, then 

\\f-ip*\\i^^ => H{f)^H* -'d\n{m/{}), 
||/-^*||i>^ ^ Hif)<H*-^^/2. 

Now with the definitions (j3.8|) we have analogues of the results of H'S.ll beginning with 
an analogue of Lemma 13.11 

Lemma 3.4 Given any 5, ^ > 0, 

#B'^{5,^) < 4.004 V2^0.6"'n'^e<^*-'^'/'^\ 

where the numerical constants assume that n ^ 100. 

Next is an analogue of p.6p : we define 

i?o = min(T9oo,a^,'/'min), (3-9) 

where a £ (0, 1) is a parameter on which we elaborate in Proposition 13.41 and Theorem 
below. Finally, an analogue of Lemma 13.21 for the set A'j^: 

Lemma 3.5 Given any 5, "i? > 0, and some a G (0, 1), we have 
with A(-) defined in Proposition \3.S\. and 



where h{{}) = 'dln{m/'d). 

From these two lemmas, the ratio ^A'.^/^B'^ is bounded from below by an exponen- 
tial in n of the form e"'/'(°>'')^ divided by a polynomial in n. The coefficient of n in the 
exponential, the analogue of the A.H of §3.21 is critical: 
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Proposition 3.4 Consider the function 

For fixed 'd, tlj{a,'d) is positive for a ^ 'i?^/2 and increases as a decreases. The equation 
'ip{a,'d) = has a root ao G (t?^/2, 1). 

(The condition on m does not impose a significant restriction in practice; even for •& as 
large as 0.04, it requires m ^ 2.3 • 10^.) 

With the above, we finally have the analogue of Theorem 13.11 Again, the statement 
is much more like that of an algorithm for computing N and the main feature is that H* 
does not appear anywhere: 

Theorem 3.3 Given any 6,e > and < {} < 1/2, assume that m < l/l-d^e^'^. Define 
the constants 

0.5(m + ^*) - 1 m In 0.6 + (0.5 In 2^ + 1/12 - 0.5ln ^i*)fi* + ln(^^ 



The numerators are the same as in Theorem \^.1\ and the function ip{a,'&) is defined in 
Proposition \3.4\ Set 



N( ) = S ^-^ ^1 ^''^'^1 + '^2) + C2, ifC2>0 and Ci + C2> 21, 
^"' \ 1.5CilnCi + C2, i/C2^0, 

as in Theorem \3.1[ Let oq £ (^^/2, 1) be the root of ip{a, i}) = 0, and a £ (0, qq) be the 
solution of the equation 

i^T/ \ m— 1 

N{a) 



where the notation [[.,.]] was defined in Theorem \3. 1\ and -d'^ is given by i3.9\) . Finally set 

771—1 



N = N{a) 



I2,fi* = mj min {a^, ^^, <f*^.J 



where f^oo has been specified in Proposition \2.^ and (/^^j^ in Proposition \3.Si Then for all 
n ^ N we have 

#{/Gi^„nC(5),||/-(/p*||i^^} 



#{/GF„nC(<5)} 



> l-e. 



This is the desired result, using deviation in norm from the MaxEnt vector ip*, instead 
of difference in entropy from H* . It says that the set of frequency vectors that are within 
"& of if* in £1 norm has all but the fraction e of the realizations that satisfy the constraints 
to the prescribed accuracy 6. 
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Comments similar to those made on Theorem 13.11 apply here also, and in addition we 
have a mild restriction on m. Further, recalling the traditional view of entropy concentra- 
tion in Fig. II. H because the norm is a more intuitive measure of closeness, Theorem 13.31 
in effect says that concentration occurs earlier, at the "vector" , instead of the "entropy" 
stage. 

We also have the analogue of Lemma 13.31 on the MaxEnt vector /* itself. Again, 
its statement is simpler than that of Theorem 13.31 it holds for somewhat smaller n when 
fi* < m, and in fact it establishes that /* is closer than t9 to (p* in norm: 

Lemma 3.6 Given any 6,e > and < {) < 1/2, let m < Xjl'Q^e^^'^ . Let a G (0, 1) he 
the solution of N{a) = 3/.f*/(4t?Q(a)), with N{a) and i?o(a) as in Theorem \3.3[ Then if 

. 3 , / 1 1 1 

n ^ -/_f max^— ,-— ,^^ 
4 V"^ ^oo f*i 



the frequency vector f* is s.t. 

f*eA'^{6,a^) and 



#/* 



> 



#{feFnnC{6),\\f-^*\\i>^} 

We can paraphrase this as (recall EC" in the Introduction) 

The MaxEnt frequency vector /*, which is no farther than a'd in ii norm from 
(/?*, has 1/e times as many realizations as the entire set of vectors that satisfy 
the constraints to the prescribed accuracies but differ from ip* by more than ?? 
in ii norm: 



i^„nC{<5) 




As far as the number of excluded vectors goes, i.e. the size of the set A'^{5,'d), we 
have Lemma 13.51 This points out that even though we can make the tolerance i? as small 
as desired, the number of excluded vectors around /* in Lemma 13.61 does not necessarily 
become small. 

Finally, we note that the result of Theorem 13.31 might be improved by tightening the 
bounds of Proposition 13.31 in the ways indicated in its proof. 
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3.4 Maximum entropy probability distributions 

The maxiniuni entropy method, MaxEnt, is most commonly presented as the solution 
to the problem of inferring a unique probability distribution from limited information 
(constraints). A very appealing construction of such distributions in the discrete case is 
the "Wallis derivation", given by Jaynes in |Jay03| , §11.4. The main idea is that the n units 
to be allocated to the m boxes are thought of as probability quanta, each of size 1/n. These 
quanta are used to construct rational approximations to an m-vector with real entries. 
When n is large, the result is the most likely (greatest number of realizations) probability 
distribution that satisfies the constraints, which we have denoted ip*. The norm formulation 
of entropy concentration in ^3.31 lends itself perfectly to obtaining non-asymptotic bounds 
in this situation. 

To emphasize that here we are viewing vectors in Fn as discrete probability distribu- 
tions, we use p in place of / and P„ in place of -F„,. Thus Lemma 13.61 becomes 

Corollary 3.1 Given any 6,e > and < -d < 1/2, and the MaxEnt vector ip* , if 
m < l/2#e^/'^ and 

^ 3 , / 1 1 1 

n ^ — /i max 



4" Va^^oo < 

where a is as in Lem.ma \3.6[ the discrete p.d. p* with rational elements obtained from v* 
as specified in Definition \2.1\ is s.t. 

p* E A'^i6, a^) and #{^ ^ P„ n C(^), ||p - v^lk > ^} ^ ? 

Corollary 13.11 increases the applicability of our concentration results significantly, as 
the principal use of the MaxEnt method is to infer probability distributionqj. The lower 
bound on n is simple: the first term depends on the desired tolerance "d and concentration 
factor e, the second just on the desired accuracies 6, and the third on the maximum entropy 
solution ip* . We illustrate this result in ^4.3l using the probability distribution of the length 
of a queue. 

We mentioned Laplace's principle of indifference in the Introduction. Corollary 13.11 
allows us to derive a quantified iustiUcation of this principle from entropy concentration: 

Corollary 3.2 Given any e > and < "!? ^ min(0.09, 1/m), let a G (0, 1) be the solution 

of the equation 

, , 3m 

where N{a) is defined in Theorem \3.3[ Let u* G Pn be the rational p.d. obtained from 
the uniform MaxEnt vector (1/m, . . . ,1/m) according to Definition \2.1\ Then for any 



''The application to probability distributions invites comparison with the concentration of measure results 
in |DP09| . 
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n ^ N{a), u* is s.t. 



#u* 1 

> - and ||tt* — (1/m, . . . , l/m)||i ^ d^. 



#{pG^n,||p-(l/m,...,l/m)||i>i?} e 

Consider constructing a discrete TTi-element p.d. from quanta of size 1/n in the absence 
of any constraints at all on the frequencies, that is in the situation of complete ignorance, 
apart from the fact that there are m mutually-exclusive possibilities. Corollary 13.21 says 
that if n ^ -/V(d), there is a dominant p.d. u* G Pn which has 1/e times more realizations 
than the entrire set of p.d.'s in P„ which differ from (1/m, . . . , 1/m) by more than i? in ii 
norm. Further, this dominant p.d. is uniform to within aiD in ii norm. 

For example, with n ^ 1232818, the p.d. u* has at least 10^ times as as many real- 
izations as the entire set of p.d.'s that differ from (0.5,0.5) by more than 0.01 in ii norm; 
further, u* is no farther than 1.22 • 10~^ in £i norm from (0.5,0.5). 

4 Examples 

We begin with a simple example of die tosses, and then give an example involving a traffic 
problem and an example having to do with the probability distribution of the size of a 
queue. In the first two examples the units (balls) out of which we construct the composite 
object are clearly distinguishable, but we do not make any use of their distinguishing 
characteristics. In the last example, one would be hard pressed to say that the units can 
be distinguished in any way. 

The examples follow this recipe: 

1. Formulate the problem with constraints on frequencies, treating them as continuous 
(real numbers). 

2. Solve to find the MaxEnt vector ip* G M™ and its entropy H*. 

3. Define tolerances 5, linking the continuous problem to the discrete allocation problem. 

4. Choose e and r^ or t9 to calculate A^. 

5. For any n ^ N, construct the integral count vector v* by rounding and adjusting 
riif* , and the rational frequency vector /* as v* /n. If we are talking about discrete 
p.d.'s p, as in ^3.4| interpret /* as p* . 

4.1 Die tosses 

We use E.T. Jaynes's classic example of tossing a die (see |Jay82| or | Jay83| , Ch. 11) 
to illustrate the parameters (5, e, and i] appearing in Theorem 13. 1^ to compare with the 
results of |Jay82| which use the asymptotic chi-squared approximation (Theorem 13. 2p , and 
to relate entropy concentration to Sanov's theorem in information theory. 
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Jaynes considers 1000 tosses of a die in two situations: first, no other information at 
all is known (including fairness or biasedness of the die), and second, it is also known that 
the average of the 1000 tosses is 4.5. What can be said in each case about the number of 
times that each face occurred? 

4.1.1 Entropy concentration 

The die tosses can be thought of as assignments of n = 1000 balls to m = 6 boxes. In the 
first case the MaxEnt solution is ip* = (1/6, . . . , 1/6) with H* = ln6 = 1.7918. No element 
of if* is 0, so we have fi* = m = Q. For this example Jaynes uses two values of e, 0.05 and 
0.005. From Theorem O these imply entropy deviations 2000AF = xi(0-05) = 11.07 
and 2000AF = xi (0.005) = 16.75, which translate to r/ = 0.00309 and r/ = 0.00467 in our 
formulation. Part (a) of Table WTH lists the A^ and a of Theorem 13.11 for rj = 0.00309 and 
various e, starting from 0.05. Part (b) does the same for r] = 0.00467. 



V 


£ 


a 


N 


0.00309 


0.05 


0.340 


16071 




0.005 


0.326 


16858 




5 • 10-6 


0.295 


18866 




5-10-12 


0.255 


22223 




5-10-18 


0.227 


25261 




5 • 10-36 


0.175 


33739 



V 


£ 


a 


N 


0.00467 


0.05 


0.340 


10065 




0.005 


0.326 


10585 




5 • 10-6 


0.293 


11913 




5-10-12 


0.252 


14132 




5-10-18 


0.224 


16141 




5 - 10-36 


0.172 


21747 



(a) 



(b) 



V 


£ 


a 


N 


0.0067 


0.01 


0.330 


6945 




0.0001 


0.304 


7597 




10-8 


0.270 


8704 




10-16 


0.226 


10633 




10-32 


0.176 


14144 




10-64 


0.125 


20771 



(c) 

Table 4.1: The A^ of Theorem 13.11 for Jaynes' die tosses, (a): no other information and 
T] = 0.00309, (b): no other information and ry = 0.00467, (c): mean of 4.5 and tj = 0.0067, 
6= = 0.00467. 

In the second case, when the mean of the 1000 tosses is also known, the information is ex- 
pressed by ^= = (1,2,3,4,5,6) ,b= = (4.5), and we have c^* = (0.0543,0.0788,0.1142,0.1654, 
0.2398,0.3475) with H* = 1.61358. Now |P=|||oo = 21,|6=|min = 4.5, and by Proposition 
12.11 the achievable tolerance 6^ is 0.00467. Here Jaynes takes £ = 0.0001, leading to 
2000AiJ = 0.012 and r] = 0.0067. Part (c) of Table O lists N under these conditions. 

To interpret the third row of Table HTT c). for example, keep in mind that there are 
g8704 ^ ;^q6773 possible Sequences of 8704 tosses, and C™^^^) ^ 4.2 • 10^^ possible fre- 
quency/count vectors, of which about 1 in 44000 has average equal to 4.5 (see §4.1.3p . 
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Then this row of the table says that 

Out of all the possible sequences of 8704 or more tosses whose frequency vectors 
satisfy the equality constraint (mean) to relative accuracy 0.00467, at most one 
in 10^ has frequencies/counts with entropy less than 99.33% of the maximum. 

We see from Table \TJ\ that 

• The asymptotic x^ result and our exact bound are quite far apart: for all of the bold 
entries in the table, the x^ result is 1000. Only a small part of this difference can be 
attributed to our ignoring equality constraints in the dimension of Fn, recall ^3.2.31 

• A^ is very insensitive to e: in the whole table, e decreases by 30 orders of magnitude 
before A^ so much as doubles. 

• The effect of optimizing the parameter a appearing in Theorem [XT] can be significant, 
as illustrated in Fig. 14.11 




0.1 0.2 0.3 0.4 0..5 O.G 0.7 0.8 0.9 



Figure 4.1: The quantities l/(2'!9o(a)) and N{a) of Theorem 13. II vs. a in the case A^ = 7597 
of Table SHc). 



4.1.2 Sanov bound 



Sanov's theorem ( |CT91j . Theorem 12.4.1) bounds the probability of a set of n-sequences 
in terms of the distribution with minimum cross-entropy (relative entropy) in this set, 
assuming that the p.d. generating the sequences is knowqj. The theorem involves sets of 
sequences and maximum entropy, so it is useful to understand how it relates to the entropy 



*The perhaps more familiar Chernoff bound follows from Sanov's theorem. See e.g. |DP09j where the 
ChernofF bound is expressed in terms of relative entropy. 
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concentration results. The Sanov theorem is usuahy expressed in the terminology of the 
"theory of types" , see Table 14. 2[ Stated in our terminology, 

Theorem. (Sanov, equiprobable version) If C is a subset ofW^ and all n-sequences of m 
symbols are equiprobable, then 

m" \ n J 
where H* is the entropy of the maximum entropy distribution in C . 

[The proof of this version of the theorem is simple: using Table 132] to translate probability 
to #, what is to be proved reduces to #(C n F„) ^ ("+™"i)e"-f^*. But ("^^"^) = \Fn\ is 



an upper bound on |C n F„|, and then we use the bound of (13. 4p . 



type 

type class 

size of type class C 

probability of class C under uniform p.d., l/m" 



frequency vector 

set of realizations of a type 



Table 4.2: Information theory (theory of types) terms in [CT91] and |Csi99] on the left, 
and our terms on the right. 

First we give a numerical example. Take the set C defined by /j ^ 0, J2i ft — 1; Sj=i ^/i - 
4.5 which we considered in ^4.1.1[ Then H* = 1.61358, so with n = 9542 the theorem gives 
7.88 • 10~^^^ as an upper bound for the probability of C n -F9542- Therefore, by the last row 
of Tableg^l #(C n F9542) ^ 1.04 • lO^^^^^ This is a big number, but e^^^^ p. iq7425 jg ^^^^i 
bigger still, leading to the small probability. 

Recalling ^3.11 we see that the Sanov result is an upper bound on the number of real- 
izations of the sequences in the set A^ defined in ()3.2p . Lemma 13.21 is a lower bound on 
this number, and Theorem 13. II is a lower bound on the ratio of this number to the comple- 
mentary number i^Bn- (See (jA.lOp in the proof of the theorem in the Appendix.) In the 
Sanov bound, the set An is interpreted as a set of "bad" or undesirable sequences whose 
probability we want to limit. On the contrary, in the entropy concentration results. An is 
viewed as the "good" set of interest, whose dominance we want to demonstrate, whereas 
Bn is the undesirable set. The concentration results then show not only that the good set 
An has a lot of realizations (Lemma 13. 2p , but that in fact its realizations dominate those 
of the bad set Bn- In other words, the concentration result answers the question 

If we adopt the set An, or even the vector /* itself, as our prediction or estimate 
in the face of the limited information, how reliable is this prediction? What 
about these other possible frequency vectors that also accord with the given 
information? 
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4.1.3 The exact number of frequency vectors satisfying the constraints 

The n tosses of the die with average known to equal 4.5 are characterized by a count vector 
1/ = (i^i, . . . , fe) s.t. i^i '^ 0, i^i + • • • + uq = n, and 2(lz^i + 2z^2 + • • • + Qi^e) = 9n. These 
constraints define a polytope C in N^ which depends on n. Using the theory and algorithms 
in |VWBC05] for counting lattice points in parametric polytopes, it is possible to compute 
the exact number of lattice points in this polytope, i.e. the number of vectors z/ G N^ 
satisfying the constraints (exactly), as a function of n. The result is a long expression, 
polynomial in floors of sub-expressions linear in n; a much simpler slight approximation 
(see |BV08j for the details) is the ordinary polynomial 

19n^ n^ 113n2 2101723n 225740219 
' "' ~ 11520 32 576 4196000 755280000' 

There is some distance between the easy upper bound |Cn-F„| ^ |-F„| = ("^ ) and this exact 
result: for n = 1000 the bound is 8.46- 10^^, whereas the exact result is 1.680752- 10^, quite 
a bit smaller. (But if we reduce m to 4, reflecting the correct dimension of Fn, recall §3.2.31 
the bound improves to 4.2 • 10^".) For n = 8704, the above exact result is 9.48684 • 10^^ 
and the bound is 4.2 • 10^^. 

4.2 Vehicle or network traffic 

Five cities are connected by two highways as shown in Fig. 14.21 The total number of cars 
in the 5 cities is known. From measurements made on one day, the number of cars that left 




Figure 4.2: Five cities connected by two highways. 

cities 1,2, and 4 is known. The number of cars that travelled the highway segment from 2 
to 3 is also known, and finally it is known that at least a certain number travelled the 5 to 
3 segment (this segment was observed for only part of the day). Given this information, 
what is the most likely number (fraction) of cars that travelled between each pair of cities 
on that day? This is the 5x5 matrix C = [cij]; cu represents the fraction of cars that 
left city i and returned to it. (Clearly, instead of cars, we could be considering packets or 
other units of traffic in a communications network.) Our information is 

^ dj = ri, i = 1,2,4, Ci3 +Ci4 + C23 +C24 = S23, £13 + 053+014^553, 
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where the Tj and Sjj are also fractions of the total. Suppose that (ri,r2,r4) = (0.13,0.25,0.1) 



and (s23,S53) = (0.11,0.07). Then the MaxEnt vector Lp* 
arranged in matrix form is 



ycii, . . . , c^ 



15' '-21' 



'"-25' 



...) 



/ 0.030790 

0.059210 

0.052 

0.02 

0.052 



V 



0.030790 

0.059210 

0.052 

0.02 

0.052 



0.018816 

0.036184 

0.052 

0.02 

0.052 



0.018816 

0.036184 

0.052 

0.02 

0.052 



0.030790 \ 
0.059210 

0.052 

0.02 

0.052 / 



with H* = 3.1419. We also have 



5, IP' 



0.1, 



0.07. 



None of the elements of ip* is 0, so /x* = m = 25. Table HT3] lists some values of A^ obtained 
from Theorems 13.11 W^ and Lemma ElSl assuming the tolerances 5 = (0.005, 0.01, oo, oo) 
for satisfying the constraints. 



-n 


e 


a 


^Tiixn 


0.01 


10-'^ 


0.9163 


120000 




10-12 


0.9126 






10-24 


0.8995 




0.005 




0.8326 
0.8257 
0.7991 




0.001 




0.3567 


160807 






0.3471 


165692 






0.3171 


183001 



d 


£ 


a 


^Tll331 


^lJXHI 


0.05 


lO-'' 


0.9833 


416022 


489889 




10-12 


0.9825 


428261 


502121 




10-24 


0.9799 


471616 


545477 


0.01 




0.9163 


1.35 -10^ 


1.58-107 






0.9126 


1.38-10^ 


1.62 -10^ 






0.8995 


1.49-10^ 


1.72- 10^ 


0.005 




0.8326 


5.96-10^ 


6.96 -10^ 






0.8253 


6.08 • 10^ 


7.08 -10^ 






0.7991 


6.51 • 10^ 


7.52 -10^ 



Table 4.3: Traffic example with 5 = (0.005, 0.01, cxd,cxd). Empty entries indicate repetition 
of previous values. Left: the N{5, e, rf) of Theorem l3.1i 120000 is the value of {m—l)/{2-doo)- 
Right: the N{6,£,i!)) of Theorem 13.31 and Lemma 13.61 

In this problem it makes much more sense to think of the frequency or count vectors 
as (traffic) matrices. Consider the 3d row of Table SiS] on the right. With n = 545500, by 
Definition 12 . 1 1 we get the count matrix 

/ 16796 16796 10265 10264 16796 \ 

32299 32299 19738 19738 32299 

28366 28366 28366 28366 28366 

10910 10910 10910 10910 10910 

V 28366 28366 28366 28366 28366 / 

How are we to interpret this? First, the number of all possible 5x5 count matrices with 



total sum 545500 is \F; 



545500 1 



545500+24 

24 



) = 7.77 • 10113. Second, 1.171 • lO^o^ of these 
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matrices satisfy the constraints. (To find this number we express (ri, r2,ri), (S23, S53), and 
5 as rationals, resulting in inequahties such as 



(13/100 - l/2000)n ^ vn + uu + 1^13 + vu + ^^15 ^ (13/100 + 1/2000) 
(25/100 - l/2000)n ^ 1/21 + z^22 + 1^23 + 1^24 + 1^25 ^ (25/100 + 1/2000) 
(7/100 - 7/10000)n < u^ + 1^53 + i/u, 



etc, and proceed as in ^4.1.3l to find the number of lattice points in the polytope.) So about 
1 out of every 10^*^ of the possible traffic matrices satisfies the constraints. Third, each of 
these matrices can be realized in many ways (multinomial coefficient). Now the entry for 
"^Tl{3?3l ^^ ^^^ third row of Table 1^31 on the right says: 

Consider all the assignments of the 545500 cars to the 25 matrix elements that 
result in one of the 1.171 • 10^*^^ matrices that satisfy the constraints to accuracy 
5. Only one in 10^^ of these assignments results in a matrix that deviates from 
u* by more than 27300 in £1 norn^^. 

And the entry for A^j^ jg^ says something more impressive: 

The MaxEnt matrix u* can be realized in lO^'^ as many ways as the entire 
set of 1.171 • 10^"^ matrices that satisfy the constraints but differ from z^* by 
more than 27300 in ii norm, u* /n differs from ip* by no more than 43.75 in ii 
nornrj. 



4.3 Queue length distribution 

Suppose we have a single-server G/G/1 queue of finite capacity c G N, in which customers 
arrive at rate A and experience a mean waiting time W. The known or measured A and 
W imply (Little's law) a mean queue length L = XW G M. So we consider the system 
under the following two states of knowledge: (a) besides c, only the mean queue length 
L is known, (b) in addition, we know that the probability that the queue is empty is in 
the interval [ai,6i], and the probability that it is full is in [02,62]- What can be said in 
these two scenarios about the distribution pQ,pi, . . . ,pc of the queue's length L? (This 
is a simple example; much more complex MaxEnt queueing problems are addressed in 
|Kou94j . |BGdMT06| .) 

Using what was said in ^3.41 here we have the problem of inferring a unique discrete 
p.d. {po,Pi, ■ ■ ■ ,Pc) £ M'^^"-'^ from the information 

lpi + 2p2 + --- + cpc = L, (4.1) 

in the first case, and from 

Ipi + 2p2 H h cpc = L, po £ [ai,bi] , pc G [02, 62] (4.2) 



This follows from the fact that ||/ — </?*||i < i? => \\u — i^*||i ^ ni9 + m. 
^"in this case we have a = 6.87 ■ 10^*. 
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in the second case. We interpret information (j4.ip as assigning n probability quanta of size 
1/n to m = c+ 1 boxes under the constraint that the "mean box index" must equal L, and 
information (|4.2p as imposing the additional constraints that the fraction assigned to box 
must be between ai and 61, while that assigned to box c must be between 02 and 62- If 
we take c = 12, L = 5.63, the MaxEnt p.d. cp* in the first case is, as expected, geometric: 



^k 



0.0897 • 0.9739" 



with H* = 2.5600. 



Now we add information to the effect that the probability of being empty is larger than 
expected while that of being full is smaller than expected, expressed by po G [0.12,0.14], 
P12 G [0.01,0.04]. We get a distribution of lower entropy, geometric between 1 and 11: 



ip*o = 0.12, ipl = 0.0768 ■0.987'', 93^2 = 0-04 with iJ* = 2.5432. 



(4.3) 



\A^ 



u*/n, 



m case 

lJ6^ln- 



(filD we have |||^~|||oo = 
in = 0.01. If we choose the 



To investigate the concentration around p* 

78, |6^|min = 5.63, and in case M.2p we add 

tolerances 6^ = 10"^, 6^ = 10~ , Table [44l lists some values of A^ obtained from Theorem 

13.31 and from Corollarv l3.1l for e = 10"^'^. The results are the same for both scenarios even 

though the ip* with bounded pq and pi2 has lower entropy, for two reasons: first, because 

H* does not appear in Theorem 13.31 or Corollary I3.lt second, because t?oo (Proposition 

12. 2p is the same in both cases. To interpret the first line of Table 14.41 suppose we choose 



is= 


S^) 


^ e i9oo 


^TrfT3 


a 


^^Ccl3.1| 


IQ-^ 


10"^ 


0.01 10" 


-'^^ 7.22 • 10-^ 


8.31 • lO'' 


1.76 


10-^ 


1.35-10^ 






0.001 




1.00 • 10^ 


8.37 


lo-*' 


1.17- 10^ 






10"^ 




1.23 • 10^^ 


6.84 


10-^ 


1.42 • 10" 






10-5 




1.45 • 10^3 


5.70 


10-8 


1.68-10^3 






10-6 




1.67-1015 


5.02 


10-9 


1.94-1015 



Table 4.4: Denominator N of the rational approximation p* E P^r to the MaxEnt p.d. 



cp G 



1)13 



from Theorem 13.31 and from Corollary 13.11 Results are the same whether only 



the mean is known, or bounds on pQ and pi2 are also known. Recall that t?c 
on 6. Empty entries signify no change from above. 



depends only 



n = 13500000. We then find 
1 



P 



(1620000,964722,977442,990323,1003387,1016618,1030037, , , 

13500000^ , , , , , , : j,^^^^ 

1043613, 1057372, 1071308, 1085429, 1099749, 540000). 



By Corollary [3Tl this rational approximation to the MaxEnt p.d. (|4.3p has at least 10^° 
times more realizations than the entire set of p.d. 's which satisfy the constraints to accuracy 
5 but differ from the p.d. (j4.3|) by more than 0.01 in ii norm. 
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5 Conclusion 

The phenomenon of entropy concentration appears when a large number of units is allo- 
cated to containers subject to constraints that are linear functions of the numbers of units 
in each container: most allocations will result in frequency (normalized count) vectors 
with entropy close to that of the vector of maximum entropy that satisfies the constraints. 
Asymptotic proofs of this phenomenon are known, beginning with the work of E. T. Jaynes, 
but here we presented a formulation entirely devoid of probabilities and provided explicit 
bounds on how large the number of units must be for concentration to any desired degree 
to occur. Our formulation also deals with the fact that constraints cannot be satisfied 
exactly by rational frequencies, but only to some prescribed tolerances, and also eliminates 
(as opposed to "resolves" ) the well-known issue of expectations vs. measurements in con- 
straints. In addition, we established a perhaps more useful version of the concentration 
result, in terms of deviation from the maximum entropy vector, instead of the usual max- 
imum entropy value, as well as results that pertain to the maximum entropy vector itself 
and not to a whole set of vectors around it. Because of its conceptual simplicity and mini- 
mality of assumptions, entropy concentration is a powerful justification of the widely-used 
discrete MaxEnt method (the other being axiomatic formulations), and we believe that 
the explicit, non-asymptotic bounds strengthen it considerably. All of our results were 
illustrated with detailed numerical examples. 
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A Proofs 

Proof of Proposition 12.11 



Rounding ensures that ||i^ — ?^<^*||^ ^ 1/2. From the explanation after Definition 12. H 
the adjustment of Pto i^* ensures ||i^* — n93*||oo ^ 1, which establishes the ioo claim. In 
more detail, the elements of v* coincide with those of (p* , and this adjustment causes 
at most 111* /2\ of the non-zero elements of i^* to differ from the corresponding elements 
of rnp* by s: 1, so \\u* - nip*\\^ s^ 1 • [^*/2j + (1/2) • {fi* - [fi*/2\) ^ 3^*/4. Hence 
||z/*/n — ip*\\i ^ {3/4:)fi* /n, which establishes the claim for the ii norm. 
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Proof of Proposition 12.21 

Beginning with the equahty constraints (j2.ip . note that A^f* = Ir -^ j4^(c^* — /) + 
A=f = b=. Set A={f-ip*) = /3. Thenwehave^=/ = 6=+/3, with||/3||oo = \\A={f-ip*)\\oo. 
Now ||A^(/ — (/?*)||oo ^ II jA^ III oo 11/ — V'* II oo) where III • III oo denotes the matrix infinity norm 
(also known as the "maximum row sum" norm). The inequahty holds because the vector 
norm \\-\\^ is compatible with the (rectangular) matrix norm ||| • |||oo (see |HJ90j . §5.7). So 
if we make ||/ — (/?*||oo ^ S^\b^\min/ III ^^|||oo, we will have ||/3||oo ^ <J^|fe^|min) as required 
by (|2.ip . The inequality constraints (|'2.'2p are handled in exacty the same way. 

Finally, for the equalities with zeros (j2.3p we have A^^f = (", where ||C||oo = ||^^'^(/ — 
¥'*)lloo. 

Proof of Proposition 12.31 

Theorem 16.3.2 of |CT91j is a similar result, but in terms of the ii norm. The function 
/ (x) = — xlnx, X G [0,1], is concave and has a maximum at x = 1/e. Let a > be 
^ 7, and consider the difference of the values of / (x) at two points that are a apart: 
g{x) = f {x + a) — f (x). Since g' (x) ^ always, the maximum of g (x) occurs at x = 
and equals —a In a. So if 7 ^ 1/e, 

Vx, 

(This is tighter than what we would get by simply applying the defining inequality of 
concavity to /.) We have now shown that if \fi — ^p*\ ^ 7, then | /j In /j — 99* In y?* | ^ 
7 In 1/7. The result of the proposition follows. 

Proof of Proposition 12.41 

Using Proposition 12. 3| we want to find a 7 s.t. m7lnl/7 ^ r]H* for all 7^7. Setting 
y = 1/7 and C = itT'/ i'nH*)^ we want to find a y s.t. for all y ^ y, y ^ C^^U- We claim that 
this inequality, where C ^ 1 and y is expected to be ^> 1, is satisfied by y = (1 + c) CliiCi 
for any c > 0. Indeed, 

y^Clny ^ C'+"^(l + c)ClnC ^ CVl^C^l + c, (A.l) 

which is possible for any c > if C is large enough. With c = 0.5, this condition is 
\/^/ln^ ^ 3/2. But this holds for (" ^ 21, a very mild requirement. Finally, the function 
y/lny is increasing for y ^ 1, so the l.h.s. of (|A.ip will hold for all y ^ y as desired. We 
have now shown that H* — H{f) ^ t]H* will hold if / is s.t. 

II f *ll < 2 ^^* 

Wf-f Hoc < 







1 


1 


max f (x -\- a) - 


-fix) -- 


= 7 In-, 


7^- 


0^a^7 




7 


e 



3 ln.{m/rjH*) 
where the "2/3" can be tightened to 1/(1 + c). 
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Proof of Proposition 13.11 

Begin with #/ = („j^,."„j„) = (n/i,.",n/J and use the fact that 

1 ■& 
Inx! = xlnx- x + -hi3; + ln\/27rH , t?G(0, 1), (A.2) 

which is defined for all x > by x\ = T[x + 1). Then we find 

ln#/ = ni/(/)-(/.-l)ln^/2^- j;inv^+-^- j; ^' 



12n '^ Unfi 

Finally, for the upper bound in Prop. 13.11 the sum of the last two terms is maximized 
when -Oq = \,-di = 0. For the lower bound, it is minimized when t^o = 0, "i^i = 1. 

Proof of Lemma 13.11 

We begin by observing that the sum over F„ n C{5) is bounded above by the sum over all 
of Fn and then use the bound of Proposition 13.11 on ^f to find 

#S„(<5,ry) ^ Y. *f "^ e"(i-'')^* e^k ^ 5(/,n). (A.3) 

H[f) < (1 - v)H* 

To evaluate the last sum, let -Fn be the subset of -F„ consisting of vectors with ^u non-zero 
elements. Since the Fn form a partition of -F„, 



Ifi in y \ 

ui + ■ ■ ■ + Uu = n, Ui ^ 1 



S(/,n), 



where the (™) comes from the fact that as pointed out in Proposition 13.11 ^f depends 
only on the non-zero elements and not on their positions. Thus 

We now need an auxiliary result on the inner sum in ()A.4p : 
Proposition A.l For any fi ^ 2, 

i/i , . . . , i/^ ^ 1 
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Proposition lA.ll is proved separately later. Using this result in ()A.4p . 

va / \ /9 1 



n 2 , 



For the first inequality we used r(/i/2) ^ 2^1'^"'^ and for the second we assumed that 
n ^ 100 and m ^ 2. Combining the above with (jA.Sp . and again assuming n ^ 100 we 
obtain the result of the lemma. 

Proof of Proposition 13.21 

Ignoring the rational requirement for the moment and denoting / by x, only xi, . . . , Xm-\ 
are independent, so our set is the subset of R"*"^ belonging to 

l^j — (/9*| ^ t?, i = 1, ... ,771 — 1, {m — l)-dimensional cube, 

|(xi + • • • + Xm-\) — (v^J + • • • + "^"m-iA ^ ^' region between two hyperplanes, 

xi + • • • + Xra~\ ^1, Xi^ 0, unit (m — l)-dimensional simplex. 

(A.5) 
We will construct inside this set an {m — l)-dimensional rectangular parallelepiped P whose 
intersection with F^ is easy to count. To construct P we will determine its two extreme 
points, the one with the largest coordinates, (/?* + y, and the one with the smallest, (/9* — z, 

where yi.Zi ^ 0. 

If (/9* + y satisfies (jA.Sh . then 

y* ^ ^, y\^ hym-i^^, y\^ hym-is^¥'m> Vi^-y^*- 

The 4th inequality is true, and the 2nd implies the first. The 2nd and 3d inequalities are 
satisfied if yi + • • • + ym-i = min (^, ^m)^ ^^'^ Hi' " Um-i will be maximized if 

yi = ■■■ = Vm-i = 7 min (i?, (/?*,) . (A. 6) 

m — 1 

Since ip* has fi* ^ 1 non-zero elements, we can assume w.l.o.g. that 93^ > 0. 

Similarly, for the other extreme point ip* — z we must have zi + ■ ■ ■ + Zm-i ^ i? and 
Zi^ip*. If some if* are the corresponding Zi are 0, and w.l.o.g. we can take the non-zero 
Zi to be zi, . . . , z^*^i. Then the Zj that satisfy the inequalities and maximize the product 
zi ■■■ Zf^*-i are 

But this needs 1}/ (/i* — 1) ^ ip* for all the non-zero ip*, which we have assumed. 

Thus from ()A.6P and ()A.7p fi* — 1 sides of P have length Vi + Zi = ;^^^ min (t9, ip'^) + 
i^}_i '&, and the other m — fx* sides have length i/i + Zi = ^^^^ min {{} , (p%^) . Again w.l.o.g. 
we can take c/9^ to be V'max' ^^^ largest element of ip*. So if we assume that i9 ^ v^maxi ^ 
has /x* — 1 sides of length 1? ( ^^^ + — rrj" ) and m — ^* sides of length ^^— j- . 
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Now a fc-dimensional parallelepiped with sides of lengths Li, . . . , L^, irrespective of its 
location in M^', contains at least [LiJ • • • [L/jJ lattice points, i.e. points in l]^ . (This can be 
established by induction on k. For A; = 1 it says that a segment of length L on the real axis 
contains at least \L\ integers.) Applying this to P with all its m — 1 dimensions scaled up 
by n, the scaled P must contain at least 



nd ( h 



n-Q 



m — \ yU* — 1/ J [m — IJ 

points whose coordinates are rational numbers with denominator n, i.e. vectors in F^- The 
first factor and its attendant condition d ^ (;U* — 1) (/sj^jn are absent if /i* = 1. 

Proof of Lemma 13.21 

Let < a < 1 be some constant whose purpose is expained later, in the proof of Theorem 
13.11 We begin by deriving a lower bound on the size of A„((5, ar/), a subset of A„((5, ?]). 
Consider the set ^ = {/ G F„ | ||/ — v'*||oo ^ ^o}- Since 'Qq ^ -i^oo, Proposition 12.21 implies 
that any/ in this set also belongs to C{b\ Further, by the middle expression in the 
definition of 'Qq^ Proposition 12.41 implies that any such / also has entropy at least (1 — 
ar])H*. Thus all / in the set A belong to An {5,ari). Finally "i^o satisfies the conditions 
of Proposition 13.21 hence the size of A is bounded from below by A (n,'!9o) A**)- But A C 
An (S, arj) C An {S, rj), so we established the first claim of the lemma. 

Now suppose that all / in An{d,arj) have at least /x ^ 1 non-zero elements; for the 
purposes of this proof we may take these to be the first /x elements. Then by Proposition 
13. H if g is an arbitrary element of An{5, arj), 

#An{6,av) ^ |A„(5,ar/)|e"(i-°'^)^*e^^fe+-+i) \_, ^ . (A.9) 

Let ^ = {ngi, . . . , n^/^); this vector has integral entries, all positive, and summing to n. The 
maximum of l/^i-|-- • - + 1/^^ equals /u — l-|-l/(n—/i-|-l), occurring when ^1 = ••• = ^^^_i = 1 
and ^^ = n — // + !. Thus the exponential in ()A.9P is at least e^^'^^. Further, the maximum 
of ^/ffi ■ ■ ■ 5m subject to gi + - ■ ■+gfj, = 1 occurs at gi = • • • = g^ = l/fj., so the last factor in 
()A.9P is at least //^/^. Finally An{5,ri) 5 An{6,ar]), and so (jA.OP implies the second result 
of the lemma, but with the number fi still undetermined. By requiring ||/ — (/?*||oo to be 
less than the smallest non-zero element of (/?*, we can ensure that there is no element of / 
which is while the corresponding element of (f* is positive; this is accomplished by the 
last term on the r.h.s. of ()3.6p . Thus we can take /j, equal to //*, the number of non-zero 
elements of 99* . 

Proof of Theorem 13.11 

The upper bound on #Bn and the lower bound on #j4„ are given by Lemmas 13.11 and 1 3.21 
Both these bounds increase when the entropy tolerance r] decreases towards 0, as makes 
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sense. To simplify the proof we assume that A{n,'do, n*) ^ 1. Then combining the two 
bounds and unifying some numerical constants 



#Bn{6,7]) 0.6- \2TrJ ' ^ {2, fi* = mpo' 

(A.IO) 
When everything else is fixed, this lower bound on ^An/i^Bn (eventually) increases as 
n — >• oo, as we want it to. In general, this behavior would have been impossible if a were 
1. This is why we introduced a and required it to be < 1: it serves to strictly separate our 
bounds on i^An and i^Bn- There is freedom in choosing the value of q, which we exploit 
below. 

To establish ()3.5p we need the l.h.s. of (jA.lOp to be ^ 1/e + l. This reduces to requiring 

n ^ Cilnn + C2, n ^ (m - 1) /i?o, (A.ll) 

where the constants Ci , C2 are 

•l/£+l' 



0.5(m + ^*) - 1 mlnO.6 + (0.51n27r + 1/12 - 0.5 In //*)//* + ln(^-^ 

Ci = —r, — -tztt:—^ C2 - 



{l-a)rjH* ' {l-a)r]H 

We will now show that (jA.lip is satisfied by 



(A.12) 



.^, . r 1.5 Ci ln(Ci + C2) + C2, if C2 > and Ci + C2 ^^ 21, ,.^ „, 

^(") = | 1.5 Clin C1 + C2, ifC2^0. (^-^^^ 

First, assume C2 > 0. Setting n = 1.5 Ci ln(Ci + C2) + C2 in (jA.lip with A; = 1 we reduce 
to 

VC^+C'2 > i.5^7%r HCi + C2) + — ^. 

Oi + O2 <-^l + 1-/2 

The r.h.s. of this condition is a convex combination of 1.51n(Ci + C2) and 1, so the first 
condition will hold if \/Ci + C2 ^ 1.51n(Ci + C2), which is true when Ci + C2 ^ 21. 
Now let C2 ^ 0. Putting n = 1.5CilnCi + C2 in (jA.lip . we reduce to establishing 
Cf 5 ^ 1.51nCi +C2, which will hold if Cf ^ ^ InCf ^ always true. 

The r.h.s. of (JA.ISP depends on a G (0,1), which has up to this point been left 
unspecified. We finally need n ^ N{a) and n ^ {m — 1) /(p,/i* = ml-&QJ, and we observe 
that as a /^, the first of these bounds increases while the second decreases. Further, the 
first bound is finite at and infinite at 1, whereas the second is infinite at and finite 
at 1. Thus there is an optimal a which makes the two bounds equal, the a which solves 
N{a) = {m -1)/ {12, ^^*=mpo (a)). 
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Proof of Proposition lA.ll 

It seems that the inequahty 



^-^ /7/1 ... 7/ ^ XT + ■ ■ ■ + Xu. = n /T-i . . . -r 



ui,. . . ,u^ ^ 1 

should hold (see |GR80j . 4.635, #4). Being unable to show this directly, we go through a 
more circuitous and lengthier proof. 

Consider the simplest case /i = 2 first. We can bound the sum as follows: 

"~^ 1 r dx 



E -7= = E . < / ^^^ = vr. (A.14) 

To see this, note that YluLi l/vK"^^^^ < /cT dxj \Jx(i% — x) = 7r/2 because the sum 
is a lower Riemann sum for the integral. Since the summand is symmetric about n/2, 
doubling this produces the desired result. 

Now consider the case of even //, i.e. [i = 2A. Divide the Vi into A pairs, each of which 

sums to some number ^ 2 and these numbers in turn sum to n: 

E 






^ \ ^ . RTTTm ^ 



fcl H h ftA = " i^i + !^2 = fcl i'2A-l + I'2A = KA 

fei,...,fcA>2 i/i,i/2>l ;^2A-l,i^2A 5: 1 

< vr^ ^ 1. (A.15) 

fcl + ■ ■ • + fcA = n 
fel,...,fcA > 2 

Here the inequality follows by applying ()A.14p . which does not depend on n, to each of the 
inner sums. Further, 

E >^ E ^^(":^;' 

fcl + ■ ■ ■ + ^A = "■ fcl + ■ ■ ■ + fc'A = '^ ~ 2A 

fcl,...,fcA^2 fci,...,fcAS!0 

where in the first equality we assume w.l.o.g. that 2A < n, and the 2nd equality follows 
from the fact that the number of compositions of A into M parts (i.e. the solutions of 
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ki + ■ ■ ■ + k]\f = N, ki ^ 0), is [ m-i )■ Finally we bound the binomial coefficient by 
("a~i^) < JJn)\^ to a^^ive at 

E . ^ < ^y~'- (A.16) 

^^ V^i • • • ^2A r(A) 

I'l H h !^2A = "■ 

Now we turn to the case of odd /U, i.e. /i = 2A + 1. Similarly to what we did above, 

1 



E 



^Ui ■ ■ ■ iy2\U2X+l 
lyi -\ h ;^2A + '^2A + l = n 



ki + k2 = n l'2A+l— K^l i/1 + ■ ■ ■ + i^2A = K2 

fcl^l,fc2>2A i/1, . . . , !^2A ^ 1 



By ()A.16p . the r.h.s. does not exceed 



r(A) 4" ^ ^ ^(^) r1 ^^^^ 

Kl + K2 = J^ K=i 



Si -t- K2 

fei > 1, fc2 > 2A 



and this last sum can be bounded by the integral 



" ^'-' .. _ .A-i/2 /-^ -^-^ ,. _ ._i/2r(A)r(i/2) 



/o ^/^r^ Jo \/r^^ r(A + l/2)' 

We have thus shown that for /u = 2A + 1, 

y < ^^^ TT^""'/'- (A.18) 

^ f- V^i---z^2A+i r(A + i/2) ^ ^ 

I^l,---,l'2A+l > 1 

Eqs. ()A.16P and (JA.ISP establish the proposition for aU /i ^ 2. 

Proof of Lemma 13.31 

First we show that if n ^^ l/i?o(a) then /* G An{S, ar]). By Proposition l2.1l n ^ l/i^o (a) =^ 
lir - V^lloo < ^oo, so by Proposition EJ/* G C (5). Further, ||/* - ^*\\^ ^ hnim^^H*)) 
means that H (/*) ^ (1 — qt?) H* by Proposition 12. 4[ Therefore n ^ l/^o (o) implies that 
/* belongs to the set An {6, arj) as claimed. 
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Next we put a lower bound on 7^/*. Applying Proposition 13. 1 1 we see that ^f is ^ the 
r.h.s. of ()A.9p in the proof of Lemma 13.21 with |A„((5, ar])! = 1, so ^/ is ^ the bound of 
Lemma 13.21 on ^An{6,rj) with A = L Then from the proof of Theorem 13.11 we see that 
i^f*/i^Bn{S,r]) is ^ the r.h.s. of (jA.lOp . but with the condition on n being n ^ l/i}o{a). 
The rest of the proof of Theorem 13.11 then applies, to the point where n has to satisfy 
n ^ N{a) and n ^ l/'0Q{a). a equalizes these bounds, and the completion of the proof of 
Theorem O then establishes that if n ^ l/i^oia), #f* /#Bn{S,r]) ^ 1/e + 1. 

Proof of Proposition 13.31 

The function y\n{m/y), m ^ 2, is increasing for y S (0,1/2]. The first implication in 
the proposition then follows immediately from Theorem 16.3.2 of |CT91j . the ii norm 
bound on entropy, which states that if two m-vectors p,q are s.t. ||p — (/H]^ ^ 1/2, then 
\H{p) - H{q)\ ^ \\p - q\\-^^ln{m/ \\p - q\\^). 

To prove the second implication we use Pinsker's inequality and the "triangle inequal- 
ity" for cross- or relative entropy, or divergence £)(-||-). Applied to / and (p* , Pinsker's 
inequality states that D{f\\ip*) ^ ^\\f-^*\\j (see |(]T91j . Lemma 12.6.1). Then the trian- 
gle inequality, using the uniform distribution as the prior or reference distribution ( |CT91j . 
Theorem 12.6.1), implies that H{ip*) — H{f) ^ D[f\\Lp*). What we want to prove follows 
from the above two inequalities. 

Pinsker's inequality can be tightened in two ways: |OW05j show that the 1/2 can be 
replaced by a factor c(</?*) ^ 1/2, and |FHT03j give right-hand sides that are polynomials 
involving powers of the norm beyond the square. 

Proof of Lemma 13.41 

Entirely analogous to that of Lemma[3TTl except that the set i?^ is defined by (|3.8p instead of 
(13. 3p . and the factor e^^^~^' in (|A.3p . coming from the upper bound on ^f of Proposition 
I3.H is replaced by the factor q'^kH'--q /2) q£ Proposition 13.31 

Proof of Lemma 13.51 

The proof follows that of Lemma 13.21 first we lower-bound the size of A'^ {6, a-d) and 
then the entropy of the / in it. The basic difference is that here we have £1 norms. If 
11/ ~ 9^*111 ^ ''^O' so is 11/ — 9?* II 001 ^1^*^ then Proposition 13.21 savs that the size of A'^ (<J, ^0) 
is at least A(n, i?q,^*). Second, concerning the entropy of / G A^(5, i^q), by Proposition 
13.31 11/ — V'*||i ^"^'o implies that H (/) is at least H* — h{a'&). The proof then follows that 
of Lemma [3^ except that the term e"(i-"»7)^* in ()A.9P is replaced by g^^i^* -h{a-d)) _ 
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Proof of Proposition 13.41 

difj/da is always negative, and ?/'(t?^/2,'i?) > if m < 1/2'd^e^' . This establishes the first 
part. For the second part, we note, in addition, that ilj{l,'d) < even for m = 2. 

Proof of Theorem 13.31 

The proof uses Lemmas 13.41 and 13.51 and is completely analogous to that of Theorem 13. 1[ 
The main feature is that H* falls out of the new ()A.10|) . the exponential is e"'^^"'' ' with 
^(a,"!?) = 'i?^/2 — h{a'd), and the condition on n is now n ^ (m — l)/([[2,/i* = m]]??Q). 
Ci, C2 are the same as in Theorem 13. H except for the denominators. Finally, N{a) is finite 
at a = and increases to 00 at a = oq, whereas 1/^0(0;) is infinite at a = and decreases 
to a finite value at a = oq. Thus the equation N{q) = [m — l)/([[2,/i* = ?7i|^o(ct)) has a 
root a between and ao, which equates the two sides and is therefore the optimal a. 

Proof of Lemma 13.61 

The proof is analogous to that of Lemma 13.31 First, by Proposition 12.11 n ^ l/'!9g(a) 
implies that /* G C{d). Second, if n ^ 3fi* /{A-d'^ia)) then ||/* - V9*||i ^ a?? by the 2nd 
claim of Proposition [2Tj Hence if n ^ Sfi* /{A'&Q{a)), f* belongs to the set A^ {6, wd). Next, 
by the argument in the proof of Lemma 13.31 7^/* can be lower-bounded by the bound of 
Lemma [33] with A = 1. So #f /#B'^{6,'&) is lower-bounded by the new (jA.lOp as in the 
proof of Theorem 13.31 but the condition on n is now n ^ 3/i*/(4'!9g(a)). The rest follows as 
in the proof of Theorem 13.31 

Proof of Corollary EH 

In this case there are no constraints, so by Proposition 12.21 'dr^, = 00. Also, fj,* = m and 
'^min ~ 1/ni. Further, if "i? < 1/m, then l/(ai9) > m, so the condition of Corollarv 13.11 
on n is n ^ Sm/^Accd). The conditions ?? < 1/m and m < l/2'i?'^e^''' are satisfied if 
?? ^ min(0.09, 1/m). Finally, -d^ia) = a^. 
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